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We consider the model 

V = xr + c, 

Z = X + E, 

where the random vector y £ R" and the random n x p matrix Z 
axe observed, the n x p matrix X is unknown, H is an n x p random 
noise matrix, f £ R" is a noise independent of H, and 6* is a vector of 
unknown parameters to be estimated. The matrix uncertainty is in 
the fact that X is observed with additive error. For dimensions p that 
can be much larger than the sample size n, we consider the estimation 
of sparse vectors 9* . Under matrix uncertainty, the Lasso and Dantzig 
selector turn out to be extremely unstable in recovering the sparsity 
pattern (i.e., of the set of non-zero components of 9*), even if the 
noise level is very small. We suggest new estimators called matrix 
uncertainty selectors (or shortly the MU -selectors) which are close 
to 9* in different norms and in the prediction risk if the restricted 
eigenvalue assumption on X is satisfied. We also show that under 
somewhat stronger assumptions, these estimators recover correctly 
the sparsity pattern. 



1. Introduction. We consider the model 

(1) y = Xd*+i, 

(2) Z = X + H, 

where the random vector y G R" and the random n x p matrix Z are 
observed, the n x p matrix X is unknown, H is an n x p random noise 
matrix, ^ G M" is a noise independent of H, and 9* = {61, . . . , 9*) is a vector 
of unknown parameters to be estimated. 

We will typically assume that 9* is s-sparse, i.e., that it has only s non- 
zero components, where 1 < s < p is some integer. The dimension p can 
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be much larger than the sample size n but we will typically have in mind 
the situation where the effective dimension s is much smaller than p and n. 
We will also assume that the elements of H are small. In this setting we will 
suggest estimators 9 = {6i, . . . ,6p) that under some assumptions recover 
6* with high accuracy in different norms, as well as under the prediction 
risk. We will also show that under somewhat stronger assumptions, these 
estimators recover correctly the sparsity pattern, i.e., the set of non-zero 
components of 9* . Our results follow the spirit of the now extensive literature 
on sparsity with £i-minimization (see, e.g., [Il[3llll[5l|6l[71|8l[IIl[l8l[I3[20l 
EH Ea Eg ESI 123 EH EH]). The main difference is in the presence of matrix 
uncertainty. The matrix X is not known and is observed with error. This 
leads us to new estimators, called matrix uncertainty selectors (or shortly 
the MC/-selectors), which are different from the Lasso and Dantzig selector 
(or their modifications) studied in those papers. 

In what follows, without loss of generality, we mainly assume that ^ and 
H are deterministic and satisfy the assumptions: 



n 



(3) 

(4) |SL < s 

for some e > 0, (5 > (a modification of Q is also used in some cases). Here 
\-\^ stands for the maximum of components norm. If ^ and H are random, 
conditions ([3| and Q can be guaranteed with a probability close to 1 under 
natural assumptions that we discuss below; we also indicate the correspon- 
ding values of e and 5. So, the results that we prove for deterministic ^ and 
H are extended in a trivial way to random ^ and H satisfying these assump- 
tions. The difference is only in the fact that the results hold on the random 
event of high probability where ([S]) and (|4| are satisfied. The setting with 
random X is covered in a similar way. We only need to consider random X 
for which the restricted eigenvalue (RE) assumption or the Coherence as- 
sumption (see below) hold with high probability. Examples of such random 
X are discussed in the literature |8l EI] • 

We introduce two versions of M[/-selectors. The first one is designed for 
the case ^ = 0, i.e., for the problem of solving a large system of linear equa- 
tions with deterministic or random noise in the matrix. This MC/-selector 
is defined as a solution of the minimization problem 

min{|0|i : 9ee, \y - Z9\^ < 6\9\i} 

where C RP is a given set characterizing the prior knowledge about 9. 
Here and below \x\g, q > 1, denotes the £g-norm of x G M'^ whatever is 
d> 1. 
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The second version of M?7-selector is defined as a solution of the mini- 
mization problem 



(5) 



min{|0|i : G e, 



n 



z^(y - ze) 



< A|^|i + e}, 



where A > is a factor responsible for matrix uncertainty. If Q is assumed, 
we choose A depending on 5 so that A > for 5 > and A = for for 5 = 0. 
Note that if = and there is no matrix uncertainty, i.e., 5 = 0, this 
second M{7-selector becomes the Dantzig selector of [8] based on the data 
(y, Z). We mainly discuss the choice \{5) = (1 + 5)5, which corresponds to 
a noise satisfying Q. This can be also used for ,^ = by setting e = in 
the definition. Nevertheless, for ^ = we consider directly the first version 
of MC/-selector because it is simpler and achieves better error bounds than 
for A = (1 + 5)5. 

Note that using in model ([T])-([2| the Lasso or Dantzig selector with Z 
instead of the true X typically leads to satisfactory results for the prediction 
loss when the noise H is small enough. However, these methods are less 
efficient in estimation of 9* and they are especially unstable in selection of 
the sparsity pattern (cf. Section]?]). In particular, they become quite sensitive 
to the values of 9* . This is explained by the fact that the true 9* is no longer 
guaranteed to stay, with a probability close to 1, in the feasible set of the 
Dantzig selector (which is also the set containing all the Lasso solutions). 

The second MC/-selector differs from the Dantzig selector based on the 
data (y, Z) in that we "penalize more" by enlarging the feasible band for 
^Z'^{y — Z9) . Indeed, setting = MP, a Lasso type analog of this MU- 

oo 

selector can be defined as a solution of the convex minimization problem 



1 , 

mm i — \y 
n 



Z9\l + Xi\9\i + X2\9\l 



with some Ai,A2 > 0. To appreciate why there is a similarity, note that 
for 9 to achieve the minimum of such a convex criterion it is necessary and 
sufficient to have: 



(6) 



- Z9)\ = y + A2I0I1 sign(%) if 9^ + 0, 



\z^{y-Z9)\^\ < y + A2|^|i 



n 



if 9j = 0, 



where the index j designates the jth component of the corresponding vector 
and sign(^j) is the sign of 9j. Therefore, the set of possible solutions is 
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"tightly" contained in G MP : ^Z^{y-Ze) < A2|^|i + Ai/2| which is 

L CO J 



ogy is thus in the same spirit 



the feasible set of the MC/-selector ^ . The anaL 
as between the Lasso and the Dantzig selector. 

The results of this paper can be viewed in several perspectives. First, we 
can interpret them as a new approach to the inference in errors-in-variables 
models. The classical ways of treating these models via some versions of 
least squares or of the method of moments heavily depend on specific iden- 
tifiability constraints that are violated when p ^ n |16l [T2] . Our approach is 
free of such constraints and requires only a modest price, which is the spar- 
sity of the unknown vector of parameters. Also, on the difference from the 
results in the conventional errors-in-variables framework, we provide non- 
asymptotic bounds for the risks of the estimators and guarantee the finite 
sample variable selection property. 

The second perspective is an extension of the theory of £i-based sparse 
recovery beyond the restricted isometry /restricted eigenvalue conditions (cf. 
[H H]) that are known to be too strong. We show that small perturbations 
of the design matrix X that bring these conditions to failure are in fact not 
so dangerous, once the method of recovery is chosen in a proper way (cf. 
Remark 4 below). 

Finally, the third perspective is in developing simple and efficient tools 
of sparse recovery for specific applications. We mention here models with 
missing data, some financial models (portfolio selection, portfolio replica- 
tion) and inverse problems with unknown operator. They are presented in 
the next section. 

2. Examples of application. Here we explain how several examples 
of application can be described by model ([T])-([2]) with a sparse vector of 
parameters 6*. 

1. Models with missing data. Assume that the elements Zij of matrix 
Z satisfy 

(7) ^ij ~ -^ijVij 

where Xij are the elements of X and rjij are i.i.d. Bernoulli random variables 
taking value 1 with probability 1 — vr and with probability tt, < tt < 1. 
The data Xij is missing if r]ij = 0, which happens with probability vr. We 
are mainly interested in the case of small vr. In practice it is easy to estimate 
TT by the empirical probability of occurrences of zeros in the sample of Zij , 
so it is realistic to assume that vr is known. Note that we can rewrite ([7| in 
the form 

(8) Z'ij = Xij + ^'ij 
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where = - vr), ^-j- = X^{r\ij - E{riij))/{1 - vr), and E{-) denotes 

the expectation. Thus, we can reduce the model with missing data ([7| to 
the form (j2| with matrix H whose elements ^-^ are zero mean bounded 
random variables. In this case assumption (4) is fulfilled with 6 which is not 
necessarily small, whereas the theoretical bounds obtained below only make 
sense if 5 is small enough. Nevertheless, the variances of ^'^j are proportional 
to vr, and we will see in Section 6 that, by modifying assumption (4), we 
obtain bounds for MC/-selector that are small if vr is small. 

2. Portfolio selection. Brodie et al. [2] recently argued that clas- 
sical methods of portfolio selection are highly instable. As a remedy, they 
proposed an algorithm accounting for the sparsity of portfolio weights and 
studied its numerical performance. A different approach to sparse portfolio 
selection can be introduced in our framework. Recall that in the traditional 
Markowitz portfolio selection, the objective is to find a portfolio having 
minimal variance return for a given expected return. This is stated as the 
optimization problem 

(9) mm{d'^X9 : 0^ = /3, ^ 0^. = 1, j = 1, . . . ,p} 

j 

where = (6i, . . . ,6p) is the vector of weights with 6j representing the 
proportion of capital invested in the j-th asset, X and /i are the covariance 
matrix and the vector of expected returns of the different assets and /3 is 
the desired return of the portfolio. 

Using Lagrange multipliers, problem ^ is reduced to solution of the 
linear equation X6 = a for some vector a G depending on /i and X. 
However, neither the covariance matrix X, nor the mean n are available. 
Only their empirical (noisy) versions are observed. Instead of X we have a 
sample covariance matrix Z, and instead of a a vector of noisy observations 
y. Thus, we are in the framework of model ([T])-(|2]) with n = p (since X is 
a square matrix) . Direct substitution of noisy values Z and y instead of X 
and a leads, in general, to instability of the solution of the linear equation 
because the dimension p can be very high (often 500 assets or more) and X 
can be either degenerate or with a small minimal eigenvalue. The methods 
that we suggest below are robust to the variations both of the matrix X and 
of the right-hand side a. 

Another way of looking at sparse portfolio selection is to revise the very 
problem (joj) . Note that minimizing e^Xe, where X is the covariance matrix, 
is motivated by the fact that we would like to get the portfolio with smallest 
"dispersion" . This requirement looks quite natural as long as we remain in 
the world of the classical second order statistics reasoning. The problem ([9| 
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is similar in the spirit to "minimal variance unbiased estimation", an old 
concept which is known to have serious drawbacks. An alternative method 
would be to look for the sparsest portfolio with a given daily return j3 (we can 
also consider weakly or monthly returns). The problem can be formalized 
as follows. Let Xij be the return of the j'-th asset on day i. The matrix 
of returns X = {Xij)ij is typically observed with measurement error. This 
error can be due to an incomplete description of the assets. For example, 
the only available quantities for the investor are often reduced to the open, 
high, low and close prices, which leads in particular to asynchronous data 
(especially when one deals with prices from markets belonging to different 
time zones) and to an underestimation of the order book effects. Indeed, to 
take into account the liquidity costs, an investor should compute the returns 
having in mind the order of magnitude of the number of assets he may have 
in his portfolio. However, such an accurate computation is only possible for 
the very few investors having access to order book data. In the case of non 
standard assets such as hedge funds, the measurement error can be also due 
to uncertainty about the management costs, the rounding approximations 
used and the way the returns are computed. Thus, instead of X we in fact 
observe some other matrix Z = (Zij)ij. 

We are looking for the sparsest portfolio, i.e., a portfolio that solves the 
problem 

(10) min{|0|o: X9 = h^J^Oj = 1, j = 1, . . . ,p}, 

j 

where 6 is the vector of the proportions of the wealth invested in each asset, 
\6\o is the number of non-zero components of 9 and b G M" is the vector 
with all the components equal to /3. It is important to note that the spars- 
est portfolio does not necessarily contain a very small number of assets, in 



particular when p is large. The minimization problem (10) is NP-hard, and 



the standard way to approximate it is to consider its convex relaxation: 

(11) min{|0|i: X9 = h,Y,9, = l,j = 1, . . . ,p}. 

j 

This problem is already numerically solvable, but since X is observed with 
error, the solution can be unstable. We do not necessarily recover the sparsest 



solution if we directly plug Z instead oi X in (11). A stable alternative that 
we suggest below is given by solving 

(12) min{|^|i: \h - Z9\oo < 6\9\i, Y,9j = 1, j = 1, . . . ,p} 
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where d is an upper bound on the noise level in the matrix X. 

3. Portfolio replication. Replicating a portfoho, or at least finding 
the type of assets in a portfolio, has become a very challenging issue in 
the recent years, especially in the hedge funds context. Indeed, replicating 
a hedge fund portfolio means obtaining a Profit and Loss profile similar 
to those of the hedge fund without investing in it (and so avoiding the 
usual drawbacks of a hedge fund investment such as excessive fees, lack of 
transparency, lack of liquidity, lack of capacity, etc.). Replicating a portfolio 
can be done by retrieving the assets belonging to the portfolio. This problem 
can be formalized through model ([T])-(|2]). 

To fix ideas, suppose for example that we observe the daily returns yi, 
z = 1, . . . , T, of a portfolio. Moreover, assume that the proportion of capital 
invested in each asset of the portfolio is constant between day 1 and day T. 
Then, we theoretically have 



where p is the total number of different assets in the portfolio, Xij is the 
return of the j-th asset belonging to the portfolio on day i and 6j the pro- 
portion of capital invested in it. As pointed out in the previous example, it 
is natural to consider that the vector of the portfolio returns {yi)i and the 
matrix of the assets returns X = (Xij)ij are observed with measurement 
error. Note that in this setup we can also treat the case where yt and Xij 
are the absolute returns (differences between the close price and the open 
price), provided that we define 6j as the (constant) quantity of the j'-th asset 
in the portfolio. 

To solve our problem, we could formally consider that any existing asset 
or derivative can, in principle, belong to the portfolio. This is of course not 
realistic. However, it is reasonable to assume that the portfolio is rather 
sparse and that any asset used in the portfolio has a behavior which is quite 
close to those of an asset belonging to a restricted, given class of reference 
assets, especially if this restricted class can still be very large. For example, 
we will not put all the oil companies in the world in our restricted class. 
Nevertheless, we suppose that if an oil company is used in the portfolio and 
is not in the class, its returns profile will look like the returns profile of 
another oil company which belongs to the restricted class. Consequently, Z 
will be made from the daily returns of our reference assets. Indeed, for any 
asset in the portfolio, we will assume that either it belongs to the assets 
defining Z or it "resembles" one of the assets defining Z. Eventually, Z can 



p 
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be seen as a noisy measurement of X, and thus the problem is described by 
model ([T|-(|2]). A numerical illustration is given in Section [Tj 

4. Inverse problems with unknown operator. This setting has 
been recently discussed by several authors jlSl El [TOl [T71 |22]. A typical 
problem is to recover an unknown function / that belongs to a Hilbert 
space H based on a noisy observation Y of Af where A : H \s a linear 
operator and V is another Hilbert space. The observation Y with values in 
V can be written as 



(13) Y = Af + (: 

where C is a random variable (typically assumed Gaussian) with values in V . 
Let and {V'jl^i be complete orthonormal bases in H and V respec- 

tively. We can write / = X^j^i — Sj=i + with some coefficients 
9*j, where the integer p is chosen very large, so that one can consider the 
remainder term r £ H as negligible. Therefore, we can reduce the problem of 
estimating / to that of recovering the vector of coefficients 6* = {91, . . . , 9*). 
Introducing the scalar products Yi = (Y, ipi) and = ((", ipi) we obtain from 



(13) the following sequence of real-valued observations: 
p 

Y, = J2 Oj{A<Pj,ij,) + (Ar, i^,) i = 1, 2, 



If we consider here only the first n observations, assume that {Ar, tl^i) = 
and define the matrix X = ((^</'j, V'i)j=iv."j=i. ■••,?)' ^^"^ vectors y = 
{Yi, . . . , Yn), ? = (?i, ■ ■ • , Cn), then we get the linear model ([T]). As discussed 
in [151 13 Uni [IZl [22] ) it is rather frequent in the applications that the op- 
erator A is not known but its action on any given function in H can be 
observed with some noise. We emphasize that in those papers the noise H 
is supposed to be small. This is consistent with the strategy of perform- 
ing many repeated measurements of {A(j)j,^i) for each pair (i,j). Thus, we 
have access to observations of the matrix X = ((^(/"j, '0i)«=iv,"j=i. ■•■,?) with 
some small noise, and therefore we are in the framework of model ([T|-([2]). 
The results obtained in [151 13 [13 [13 l22] consider the case n = p and deal 
with non-degenerate matrices X. This framework is not always convenient, 
especially if n and p are very large. The approach that we develop in this 
paper is more general in the sense that, for example, if n = p we can treat 
degenerate matrices X that satisfy some regularity assumptions. We also 
cover the case p^ n, which is a useful extension because by taking a large 
p we can assure that the residual r is indeed negligible. 



October 15, 2009 



SPARSE RECOVERY UNDER MATRIX UNCERTAINTY 



9 



3. Sparse solution of linear equations with noisy matrix. In this 
section we consider the simplest case, ^ = 0. Thus, we solve the system of 
linear equations 

y = xe, 

where X is an unknown matrix such that we can observe its noisy values 

Z = X + H, 

where H satisfies Q. 

Let be a given convex subset of W. We will assume in this section that 
there exists an s-sparse solution Og oiy = XO such that Og ^ Q. Consider the 
estimator 9 of Og defined as a solution of the following minimization problem 

(14) min{|0|i: G 9, \y - Ze\^ < 5\e\i] . 



Clearly, ( 14 ) is a convex minimization problem. If = or if is a linear 
subspace of W or a simplex (the latter case is interesting, for example, in the 
context of portfolio selection), then (14) reduces to a linear programming 
problem. 

Note that under assumption ^ the feasible set of problem ( 14 1 : 
ei = {0Ge: \y-Ze\^<5\9\^} 
is non-empty. In fact, Og ^ Qi since 
(15) \y- Z9g\^ = \~eg\^<\r.\^\eg\r<5\eg\^. 



Thus, there always exists a solution 6 of (14). But it is not necessarily 
unique. We will call solutions of (14) the matrix uncertainty selectors (or 
shortly MC/-selectors). 

To state our assumptions on X we need some notation. For a vector 6 
and a subset J of {1, ... ,p} we denote by 9j the vector in W that has the 
same coordinates as 9 on the set of indices J and zero coordinates on its 
complement J'^. 

We will assume that the matrix X satisfies the following condition {re- 
stricted eigenvalue assumption [Ij): 

Assumption RE(s). There exists k> such that 

\XA\2 

min — > K 

A^O:|A,jc|i<|Aj|i ^n\Aj\2 

for all subsets J of {1, ... ,p} of cardinality \J\ < s. 
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A detailed discussion of this assumption can be found in [1^ . In particular, 
it is shown in [1 that the restricted eigenvalue assumption is more general 
than several other similar assumptions used in the sparsity literature |1H 
[HI [26] • One of such assumptions is the coherence condition \TD^ that has the 
following form. 

Assumption C. All the diagonal elements of the matrix ^' = X'^X/n are 
equal to 1 and all its off-diagonal elements ^ j, satisfy the coherence 

condition: maxj^j < p with some p < 1. 

Note that Assumption C with p < {3as)^^ implies Assumption RE(s) 
with K = y^l — 1/a (cf. [Ij or Lemma 2 in [18j. 

We now state the main result of this section. 



Theorem 1. Assume that there exists an s-sparse solution 9s ^ @ of 
the equation y = X9. Let hold Then for any solution 6 of (14) we have 
the following inequalities. 

(16) ^\x{e-es)\l < 46^\9\l 

n 

(a) If Assumption RE(s) holds, then 

(17) \e-9,\, < ^\eh. 

(Hi) If Assumption RE(2s) holds, then 

(18) \e-9s\2 < -\o\i. 

K 

(iv) If Assumption C holds with p < a > 1, then 

(19) \e-es\oo < 2(1 + — ^=^U|^|i. 

Y 6y'sa[a — Ij / 

Proof. Set A = 9 - 6s and J = J{9s), where J{9) denotes the set of 
non-zero coordinates of 9. Note that 

(20) IXAI2 = \Z9-y-E9\2 

< ^/n(^\Z9 - y\oo + \'E9\ao 

< {6\9\i + \EU9\i 

< 2^^/n\9\l, 
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which proves (16). 

Next, by the standard argument (cf. Lemma [T] below or, e.g., [3 [1]) we 
have 

|Ajc|i < |Aj|i. 

Thus, 



(21) 



<2|Aj|i <2Vs|Aj|2 < 



K\ n 



^A|2, 



where the last inequality follows from Assumption RE(s). Combining (20) 
and ([2l| we get ([it]). 

To prove (18), we introduce the set of indices J\ corresponding to those s 
coordinates of A outside J = J{Q^ which are largest in absolute value (we 
assume without loss of generality that 2s < p) . Define Jqi = J U Ji . By a 
simple argument that does not use any assumption (cf., e.g.,[8l [1] and the 
papers cited therein) we get: 



(22) 
Thus, 

so that 



|A 



1 

■^01 I ^ 



< 



|A 



|Ajcj2 < < |Aj|2 < |Aj„j2, 



lAlo < 2|A 



^01 |2- 



Now, by Assumption RE(s) and (20), 
lAjoib 



1 25, 
< — ^P^A 2 < — 
K\ n K 



and hence (18) follows. 



We finally prove (19). Note first that 
1 



(23) \^{B-Q. 



sj\oo — 



< 



n ' 
1 

— max 
n i<j<p 

^ \X{9 - es)\2 max |xq)|2 



xJ.)X(^ - 9, 



\X{9-es)\2 



where xq) denotes the jth column of X and the last equality uses the fact 
that |x(j')|2 = since IxQ-jl^/n are the diagonal elements of X"^X/n. There- 
fore, by (pOl), 



(24) 



\^{e-9,)\^<26\0\i. 
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Now, since the jth component of ^{9 — 9s) is 



{^{9 -9s)), = {9, 



0, 



where 9 si is the ith component of 9s, we obtain 

(25) \9-9s\oc<2d\9\i + p\9-9s\i. 

Recah that Assumption C with p < {3as)~^ imphes Assumption RE(s) with 
At = \/l — 1/a (cf. [1 or Lemma 2 in flSQ. Thus, we can apply (17) with 
this value of k to bound \9 — 9s\i in (25) which finally yields (19). 

Remark 1. We can replace by \9s\i in all the inequalities of Theorem 

Remark 2. It is straightforward to deduce a bound for \9 — 9s\q for any 
1 < 9 < 2 from the bounds (17) and (18), as it is done, e.g., in pQ. 

Under the assumptions of part (iv) of Theorem [T] we get 

(26) l^-^sloo < aia)6\9\i. 



is a constant. Based on this we can define 



where C,(a) = 21+ , ^ ^ 
the thresholded estimator 9 = {9i, . . . ,9p) where 



(27) 



9j = 9jI{\9j\>T}, j = l,...,p, 



with the data-dependent threshold r = C*(a;)(5|^|i for some q > 1. Here 
/{•} denotes the indicator function. It is useful to note that since the MU- 
selector 9 is, in general, not unique, the thresholded estimator 9 is neither 
necessarily unique. 

Our next result shows that the thresholded estimator 9 recovers the spar- 
sity pattern, and moreover it recovers the signs of the coordinates of s- 
sparse solution 9s (this property is sometimes called the sign consistency, cf. 
mi]). We define 



sign I 



-1 9 < 0, 

a 9 = 0, 

1 if 6* > 0. 
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Theorem 2. Assume that 6s £ @ is an s-sparse solution of y = X9, 
and that C {0 g : < a} for some a > 0. Let and Assumption 
C hold with p < (3as)~^ for some a > 1. If 



(28) 

then 
(29) 



mill \6sj\ > C*(a)5a, 



sign 9j = sign Osj, j = l,...,p, 



for all 9j in (21) such that 6 is an MU -selector defined in (14)- 



Proof. For j J{9s) we have Osj = and thus, by (26), \6j\ = \9j — 9sj\ < 
CJa)6\9\i = T. Therefore, 9j = for j J{9s). For j G J{9s) note that 
(26) imphes: \9j — 9sj\ < C*(Q)5a. This and assumption (28) yield that 9j 
has the same sign as 9sj- 

Remark 3. Note that, under Assumption C with p < {3as)~^ as required 
in Theorem[2] the g-sp arse solution is unique, cf., e.g., [18 , p. 93, so that the 
right hand side of (29) is uniquely defined. The estimator 9 is not necessarily 



unique, nevertheless Theorem [2] assures that the sign recovery property (29) 
holds for all versions of 9. 



4. Sparse recovery for regression model with unknown design 

matrix. We consider now the gener al model and assume that it 

holds with an s-sparse vector of unknown parameters 9* = 9s G 0. Because 
of the presence of noise ^ that is typically not small we need to change the 
definition of MC/-selector. We now define the Mt/-selector as a solution 
of the minimization problem 



(30) min{|6'|i : 6* G 9, 

Note that if (5 = and 6 
selector of [8]. 



n 



Z^{y - Z6) 



< {l + 6)6\9\i + e}. 



this MC/-selector becomes the Dantzig 



Similarly to (14), the problem (30) is a convex minimization problem and 



it reduces to linear programming if = M^, is a linear subspace of MP or 
a simplex. 

Throughout this section we will assume for simplicity that the matrix X 
is normalized, so that all the diagonal elements of the Gram matrix ^ = 
X'^X/n are equal to 1. Extensions to general matrices are straightforward. 
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it only modifies the constants in the expression (1 + (5)(5|0|i + e in (30) and 
in the theorems. 

Note that under assumptions (|3]) and Q the feasible set of the convex 
problem (30) is non-empty: 



92 = G e : 



n 



z^{y - ze) 



<{l + 5)5\e\i+e] 



To prove this let us show that the true vector 0* = Og belongs to 02- In fact, 
by 



(31) 



n 



Z^{y - Zds) 



< 



z^{x9, + i-zes 



n 

< £ + 



+ 



1 



n 



-z'^^e. 



n 



Next, note that, by Q and by the fact that all the diagonal elements of 
n are equal to 1, the columns zq) of matrix Z satisfy |z(j)|2 < V^(l + 
6). Therefore, arguing as in (23) we obtain 



(32) 



1 rr 

-Z^EOs 
n 



< 



1 + '^ 



n 



<{i + 6)\Ees\^<{i + 5)5\e,\i. 



This and (31) yield 



n 



Z^{y - ZOs) 



< {l + 5)5\9s\i + e. 



Since we also assume that Og belongs to 6, the fact that Og G O2 is proved. 
Thus, there always exists a solution of (30 ). Of course, it is not necessarily 
unique. 

Theorem 3. Assume that model (^-(^ holds with an (unknown) s- 
sparse parameter vector 9* = 6s £ & and that all the diagonal elements of 
X'^X/n are equal to 1. Let and hold. Set 

u = 2{2 + 5)5\es\i + 2e. 



Then for any solution 6 of (30) we have the following inequalities. 

4z^s 



(i) Under Assumption RE(s): 
(33) 
(34) 



|i 



< 



-\x{e-es)\l < ^ 
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(ii) Under Assumption RE(2s): 



(35) 



I? < 



s, Vl<g<2, 



(in) Under Assumption C with p < a > 1: 
(36) |^-^.|oo 



3a + 1 
< — — v. 



3(a - 1) 

Proof. Set A = ^ - 61^ and J = J {OX Note first that ^ and the fact that 
6 belongs to the feasible set 02 of (30) imply 



(37) 



1 rp 

-X^XA 


< 


n 


CO 




+ 



n 



z'^iy - ze) 



+ 



E^X{e - 9s 



n 



1 T 

-z^C 


+ 


-z^Ee 




n 


OD 


n 


oo 



+ 2e + 


1 ^ - 
-Z^EO 


+ 




n 


oo 



n 



x{e 



Now, 
(38) 



where are the columns of E and we used that Ix^^-j I2 = yra by assumption 
on X'^X/n, and \£,{j)\2 < Sy/n by ml. This implies 



(39) 



1, 



n 



X{9-9s) 



< 



7s 11 



X 



n 



< 



Next, as in (32), we obtain 
(40) 



-Z' E9 
n 



< {1 + S)6\9\i. 



We now combine (37), (39) and (40) to get 
1 



(41) 



-X'^XA 



n 



<2e + 2(1 + 6)5\9\i + 5\9 - 9s\i < v. 



Taking into account (41), the proof of (33), (34) and (35) follows the same 
lines as the proof of Theorem 7.1 in [Ij where we should set r = u m = s. 
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We now prove (36). We proceed as in the proof of (19) in Theorem [T] 
with the only difference that now we replace (24) by (41). Thus, instead of 
(|25|) we obtain 



(42) 



<v + p\e-es 



1- 



Next, recall that Assumption C with p < {3as)^^ implies Assumption RE(s 
with = 1 — 1/a (cf. [1 or Lemma 2 in 
with q = 1 and = 1 — 1/a we obtain 
theorem. 



18 



36 



). Using in (42) the bound (35) 
. This finishes the proof of the 



Note that, in contrast to Theorem [T| the bounds of Theorem [3] do not 
depend on \6\i but on the unknown \9s\i (cf. definition of i^). This drawback 
can be corrected for small values of 6, as shows the next result. 

Theorem 4. Let the assumptions of Theorem^hold and S < Set 

i^i = 2{l + 5)S\e\i + 2e. 



Then for any solution 6 of (30) we have the following inequalities. 



7s 1 



< 



(i) Under Assumption RE (s): 
(43) 

(44) \\x{e-es)\l < 

(a) Under Assumption RE(2s): 

(45) 1^ - 



4z^i\9 

1^2 



Avis 



A5s 



45s 



Vl< g < 2, 



(Hi) Under Assumption C with p < a > 1, and 5 < 5^.' 



(46) 



< 



2(3a+ 1) 
3(a - 1) 



Proof. We use the same notation as in the proof of Theorem [3j From (41 ) 
and the fact that |Ajc|i < |Aj|i we obtain 



(47) 



1 



n 



< 



lAI 



n 



< \A\i{vi + 5\A\i) 

< 2|Aj|i(z.i+25|Aj|i) 

< 2V^|Aj|2(i^i + 2V^5|Aj|2). 
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Similar arguments as for (21) easily yield the inequality 



(48) 



A 



J|2 



< 



1 



4bs 



and (43). In the same way, (45) deduces from (48) following the analo- 
gous part of the proof of Theorem 7.1 in [1] where we should set r = 
(z^i/2) (1 - ^bsjK^Y^ and m = s. 



Finally, to get the sup-norm inequality (46) we proceed as in the proof of 
(19) in TheoremJTlor in that of (36) in Theorem [3| with the only difference 
that instead of ( |25| ) we use the bound 

l^-^sloo <Z^l + (p + 5)|^-0.|l 



that follows from (41) and the fact that b < This finishes the proof of 
Theorem IH 



As in Section [3j we now define a thresholded estimator d = (^i, . . . , ^p) 
by the formula 



(49) %=%/{|e,|>ri} 

where the threshold is given either by 

3a -h 1 



J = 1, • • • 



(50) ri 
for a > 1, a > 0, or by 

(51) ri = 



3(a 

2(3a-hl) 
3(a- 1) 



2e + 2(l + 5)(5|^|: 



for a > 1. Note that the threshold (51 ) is completely data-driven if e and b 
are known. 

The next theorem shows that under some assumptions the thresholded 
estimator defined in (49) recovers the sparsity pattern, and moreover it 
recovers the signs of the coordinates of s-sparse solution Qg ■ 

Theorem 5. Assume that model (^-(^ holds with s-sparse vector of 

unknown parameters 9* = 9s ^ & and that and Assumption C hold 

with p < (305)^"*^ for some a > 1. Let either Q C {9 £ W : \9\i < a} for 

r I 2 

some a > and the threshold ri is given by (50), or 6 < |j and the threshold 
Ti is given by (51). If 

(52) Tain\9sj\>Ti, 
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then 

(53) sign Oj = sign Osj, j = l,...,p, 



for all 9j in (49) where 6 is a MU -selector defined in (30). 



Proof. It goes along the same lines as the proof of Theorem [2j 

We can make here the same remarks as in Section [3] about the non- 
uniqueness of the estimators. Indeed, 6 is not necessarily unique but The- 



orem [5j assures the the sign recovery property (53) holds for all versions 

of e. 

Remark 4. The argument of this section can be applied with minor 
modifications to the model 

y = ^r + e, 

Z = X + E. 

This is no longer the errors-in-variables setting, but just the usual regression 
setting where X is some "nominal" design matrix and H can be viewed as its 
perturbation. The results of this section suggest that small perturbations of 
the design matrix X beyond the restricted eigenvalue condition are in fact 
not so dangerous, once the method of recovery is chosen in a proper way. 
Indeed, such perturbations lead to the extra terms in the bounds propor- 
tional to the £i-norm of the solution. Roughly speaking, our bounds suggest 
that the MC/-selector is robust with respect to possible violations of the the 
restricted eigenvalue condition, provided that the perturbations are small 
enough and the ^i-norm of the true 6 is reasonably bounded. This offers a 
possible way of relaxing the strong conditions usually imposed in the con- 
text of ^i-penalized sparse estimation. Note that another way to do it can 
be found in |13l [T^j suggesting a computationally feasible method of sparse 
estimation with no assumption on X. However, the oracle inequalities of 
|13t [H] hold only for the prediction risk. 

5. Approximately s-sparse solutions. The results of the previous 
sections can be easily generalized to the setting where the true 0* is arbitrary, 
not necessarily s-sparse. This might be of interest in the context of inverse 
problems with unknown operator, as discussed in the Introduction. Then 
the bounds will involve a residual term, which is a difference between 9* 
and its s-sparse approximation Og. In particular, we can take 9s as the best 
s-sparse approximation of 0*, i.e., the vector that coincides with 9* in the 
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s coordinates with largest absolute values and has other coordinates that 
vanish. 

We will use the following slightly strengthened version of Assumption 
RE(s) where we only increase a numerical constant in the definition of the 
set over which the minimum is taken (cf. [1]). 

Assumption RE(s,2). There exists k> such that 

\XA\2 

mm — — > K 
A^0;|Ajc|i<2|A,j|i y/n\Aj\2 



for all subsets J of {1, ■ ■ ■ ,p} of cardinality \J\ < s. 

It is easy to check that Assumption C with p < for some q > 1 implies 
Assumption RE(s,2) with = 1 - 1/a (cf. [Ij). 

We now state the main result of this section. 

Theorem 6. Assume that there exists a solution 9* £ Q of the equation 
y = X9. Let hold. Then for any solution 6 of (I4) we have the following 
inequalities, 
(i) 

(54) ^\x{e-e*)\l < ^s'^\o\l 

(a) If Assumption RE(s,2) holds, then 

(55) \e-e*\i < ^^\e\i + 6 mm \e*jc\i. 

K J-\J\<s 

(Hi) If Assumption C holds with p < a > 1, then 

(56) l^-rioo < 2(1 + -— ^=^)<5|^|i + -^ inin 

\ 5y/sa{a - 1) / 5as J:\j\<s 

Proof. Set A = — 9* and let J C {1, . . . ,p} be an arbitrary set of indices 
such that I J| < s. First, note that (54) is already proved in Theorem[T] since 
(|20| is vahd with A = 9 -9*. 

We will use the following elementary fact (cf., e.g., O El [71 [8]) that we 
state for convenience as a lemma. 

Lemma 1. Let 9 be a solution of the problem 

min{|0|i : 9ee'} 
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where Q' is a subset ofW. Let 9* be any element of Q' and J any subset of 
{1, . . . ,p}. Then for A = 9 — 6* we have 



(57) 

Proof. 



|Ajc|i < |Aj|i + 2|e} 



|l + |0jc|l 



> 



l^li > l^li = I^jIi + I^j^Ii 

\Aj + e*j\i + \Ajc + e*,\i 

l^jli ~ |Aj|i + |Ajc|i — \9*jc\i. 



To prove (55) consider separately the following two cases: (a) 2|0jc|i < 
|Aj|i and (b) 2|0jc|i > |Aj|i. In case (a) we use (57) to obtain |Ajc|i < 
2|Aj|i. Therefore, by Assumption RE(s,2) and (54), 

|Aj|2 < ^|XA|2 < — l^li. 
K.\/n K 



This and (57) imply that, in case (a), 
(58) 



A 



< 2|Aj|i+2|0}c|i <2V^|Aj|2 + 2 



In case (b) we immediately deduce from (57) that |A|i < 6|^jc|i. Combining 
this with ( [58]) w e obtain (55). 

To prove (56) note that the argument leading to (25) is applicable here 
with 9* in place of 9s. Thus, 

(59) |^-r|oo <2(5|^|i + /)|^-r|i. 



Now, as mentioned above. Assumption C with p < 



a > 1, implies 



Assumption RE(s, 2) with k = 1 — l/g . Using (55) with this value of k to 
bound \9 — 9s\i in (59) we arrive at (56). This proves the theorem. 

Note that under Assumption C we can also bound the £2 norm of the 
difference 9 — 9*, as well as all its ir norms with r > 1. However, Assumption 
C is rather restrictive. For instance, it is not valid for Toeplitz matrices ^ or 
for matrices X with independent standard Gaussian entries (for the latter 
case. Assumption RE is assured with overwhelming probability if s is of a 
smaller order than n/logp). The next theorem shows that we can bound 
correctly the I2 norm \9 — 9*\2 under the following condition which is weaker 
than Assumption C but somewhat stronger than Assumption RE. 
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(60) 



Assumption RE'(s,2). There exist k > and ci > such that 

\XA\yn + Cla\AJ\2/^/s 



mm 

A^0:|Ajc|i<2|Aj|i+a 



|Aj|i 



for all a >0 and all subsets J of {1, ... ,p} of cardinality \ J\ < s. 

Note that Assumption RE(s, 2) is a special case of ( |60| ) corresponding 
to a = 0. Note also that Assumption RE'(s,2) is satisfied if the restricted 
isometry assumption [3 El E] holds with the isometry coefficient close enough 
to 1. This is not hard to show following the lines of [5|. 

Theorem 7. Assume that there exists a solution 9* £ Q of the equation 
y = XO. Let and Assumption RE' (2s, 2) hold. Then for any solution 9 
of (14 ) we have 

(61) \e-e*\2 < -1^11+^4+^) min ■ 

K \ K J J:\J\<s 



Proof. Set, as before, A = 9 — 6* and let J C {1, ... ,p} be an arbitrary 
set of indices such that \J\ < s. We first note that (57), (22) and the fact 
that |Aj|i < ^|Aj|2 imply 



(62) 



^Jgj2 < |Aj|2 + 



Consider separately the cases 2 1 I i/\/s < |Ajoj2 and 2|0}c|i/\/s > |Aj(,j2- 
(a) In the case 2\9Jc\l/^/s < |AjqJ2 we have |AjcJi < 2|AjqJi. Also, 
\Joi\ < 2s by the definition of Jqi. Therefore, using Assumption RE'(2s,2) 
with a = and ( 54 ) , we get 

|Aj„j2<^|AA|2<-|^|i. 



This and ( 62 ) imply 



(63) |A|2 < |Ajoj2 + |Aj|2 + ^|e}c|i 

V s 



< 



4:5. 



1 + 



Thus, (61) is proved in the case 2\6Jc\l/^/s < |Aj(,j2- 

(b) It remains to prove (61) in the case 2\9jc\i/\/~s > |Aj(,j2- This condi- 
tion and ( 62 ) immediately yield 

4 

Ajc 2 < 
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SO that 

(64) |A|2 < |Ajoj2 + ^|^}c|i. 



Next, from (57) we easily get 

|AjcJi < |AjoJi + 2|6'}c|i. 



Therefore, using Assumption RE'(2s,2) with a = 2|0jc|i and (54), we find 



where we used that 2\9jc\i/ ^/s > |Aj(,j2- The last display and (64) imply 
that (61) holds in the case 2\6jc\i/^/s > |AjqJ2- 

6. Random noise. If ^ and H are random and conditions ([3| and (|4| 

are satisfied with a probability close to 1, then all the bounds in the above 
theorems remain valid with the same probability. This holds in different 
situations under natural assumptions that we briefly discuss in this section. 

First, it is not hard to see that if ^ is normal with zero mean and covariance 
matrix a^I where / denotes the identity matrix, and we take 



'logp 



n 



(65) e = Aa^ 

for some ^ > (1 + S)\/2, then condition (jsj) holds with probability at least 
p is very large, this probability is very close to 1. Similar 
remark holds for subgaussian ^. 

For more general ^ we can guarantee condition ([3| only with a larger value 
of £ and with a probability that is not as close to 1 as in the gaussian case. 
For example, if the components of ^ are independent zero mean random 
variables with uniformly bounded variances, E{^f) < < oo, i = 1, . . . , n 
and if the elements Xij,i = 1, . . . ,n,j = 1, . . . ,p, of matrix X satisfy 

1 " 

— max iXjA'^ < c 

for some constant c, then condition ([3| holds with probability at least 1 — 
O (^^^f) [IE] - In particular, we can take 

V n 
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and then condition ([3| holds with probabihty at least 1 — ((logp)^'^). 

For the choice of 6 in condition Q we can consider the examples related 
to portfolio selection and to inverse problems with unknown operator, cf. 
Section 2. In both examples we have repeated measurements. The matrix 
Z is either the average of several observed matrices with mean X, or the 
empirical covariance matrix, with X defined as the corresponding popula- 
tion covariance matrix (in the latter case p = n). Then the threshold 6 in 
condition Q can be determined in the same spirit as e in condition (|3]). We 
omit further details. 

Finally, consider the model with missing data discussed in Section 2. In 
this example direct application of condition Q leads to bounds which are 
too loose. Indeed, 6 can be of the order of |X|oo. However, we argue that 
the M[/-selector of the form ^ with suitable A still satisfies good bounds 
if the probability tt that an entry of X is not observed remains small. This 
needs a refinement of our argument for the particular setting. We sketch it 
now. Note first that under the assumptions of Theorem [3] for a deterministic 
matrix X and for Zij = Xij where are defined in Section 2, we have 
with probability close to 1 when n is large: 



(66) 
(67) 
(68) 



'-E-X 
n 

n 
1 







<'^1 


oo 


n 




oo 


- diag(S^ 










oo 





diag(H^H) 



n 



< Ctt, 



where H is the matrix with entries diag(S S) denotes the diagonal ma- 
trix having the same diagonal elements as H-^S, C > is a constant, and 



61,62 > are small if n is large. Indeed, (66) and (67) follow from the stan- 
dard properties of zero mean subgaussian variables, while (68) is due to the 
fact that the expectations of the diagonal elements of ^^-^H are proportional 
to vr. 



We now observe that under assumptions (66) - (67) the constant {1 + 6)6 



in (32) can be replaced by 61 + 62 + Ctt. This motivates the use of MU- 
selector (|5| with A = 61 + 62 + 011. For such MC/-selector we have an analog 
of Theorem [3] if we replace assumption (j4]) by assumptions (66) - (67). The 
only difference is in the form of u which now becomes a linear combination 
of 61, 62 and vr. This new value of u is small for n large enough and small vr. 
In conclusion, the MC/-selector ^ with suitable A achieves good theoretical 
bounds provided that vr is small enough and n is large. This is confirmed by 
simulations in the next section. 
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7. Numerical experiments. We present here three illustrative numer- 
ical applications. The first two are based on simulated data and the last one 
on real data. 

7.1. Censored matrix. We begin with a model where we only observe 
censored elements of the matrix X. More precisely, for a positive censoring 
value t, instead of Xij, we observe 

(69) Zij = XijI{\Xij\ <t} + t{signXij)I{\Xij\ > t}. 

Experiment. 

— We take a matrix X of size 100 x 500 (n = 100, p = 500) which is the 
normalized version (centered and then normalized so that all the diagonal 
elements of the associated Gram matrix are equal to 1) of a 100 x 500 matrix 
with i.i.d. standard Gaussian entries. 

— For a given integer s, we randomly (uniformly) choose s non-zero elements 
in a vector 9 of size 500. The associated values are equal to 0.5. We will take 
s = 1,2,3,5,10. 

— We set y = X9 + ^, where a normal random vector with zero mean and 
covariance matrix a^I where a = 0.05/1.96 (so that for an element of ^, the 
probability of being between —0.05 and 0.05 is 95%). 



— We compute the matrix Z following (69) with t = 0.9. 

— We run a linear programming algorithm to compute the solution of (i 
where we optimize over Q = M.^^. The value of e is chosen following (( 
with A = {1 + (5)\/2. We note here that in the simulations below the choice 
of £ is not crucial because the terms with 5 in the definition of the estimator 
are of larger order of magnitude. Varying e within a sufficiently wide range 
does not essentially modify the simulation results. The choice of parameter 
6 is done the following way. 

Choice of 5. The choice of 5 in practice is quite crucial. A very small value 
of 5 means that the matrix uncertainty is not taken into account whereas 
a too large value of 6 means that we overestimate this uncertainty. In both 
situations the resulting estimator exhibits poor behavior. Consequently, in 
practice it is important to select 5 within a reasonable range of values. We 
suggest to choose the range of candidate 5 with the "elbow" rule. We plot 
the number of retrieved non-zero coefficients a as function of 6. Then we 
consider that a value of 5 can be chosen only if the plot is (or begins to be) 
flat around it. Usually such a plot is highly decreasing at the beginning and 
then stabilizes, cf. Figure 1. Following this, we take the values in the flat 
zone <5 = 0.05,0.75,0.1 for s = 1,2,3,5 and <5 = 0.01,0.05,0.1 for s = 10 
(the plot for s = 10 suggests to start with smaller values for 6). 
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Fig. 1. Average number of non zero coefficients in the model with censored matrix for 
s = 1 and s = 10. 



— We also compute the Lasso estimator with Mallows' Cp choice of the 
tuning parameter (we use the Lars R-package of T. Hastie and B. Efron) 
and the Dantzig selector of [8j, with the same value e. Moreover, we compute 
the thresholded versions of the estimators (T-Lasso, T-Dantzig, T-6). More 
precisely, the retrieved coefficients whose absolute values are smaller than 
20% of the true value of the non-zero coefficients (i.e., smaller than 0.1) are 
set to zero. 

— For all the considered estimators 6 we compute the error measures 

Erri = \9- 9\l and Erra = \X{9 - 9)\l 

We also record the retrieved sparsity pattern, which is defined as the set of 
the non-zero coefficients of 9. 

— For each value of s we run 100 Monte Carlo simulations. 
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Results. Tables 1-5 present the empirical averages and standard deviations 
(in brackets) of Err i, Err2, of the number of non-zero coefficients in 9 (Nbi) 
and of the number of non-zero coefficients in belonging to the true sparsity 
pattern {Nb2). We also present the total number of simulations where the 
sparsity pattern is exactly retrieved (Exact) . Note that here and in the next 
numerical examples when a coefficient belonging to the sparsity pattern is 
retrieved it has systematically the correct sign. 





Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.0679 


12.33 


95.20 


1 







(0.0128) 


(2.016) 


(2.245) 


(0) 




T-Lasso 


0.0271 


2.712 


1 


1 


100 




(0.0086) 


(0.8615) 


(0) 


(0) 




Dantzig 


0.0399 


3.982 


56.92 


1 







(0.0076) 


(0.9880) 


(5.591) 


(0) 




T-Dantzig 


0.0260 


2.599 


1 


1 


100 




(0.0068) 


(0.6860) 


(0) 


(0) 




6 = 0.05 


0.0122 


1.231 


1.16 


1 


85 




(0.0027) 


(0.2783) 


(0.393) 


(0) 




T-6 = 0.05 


0.0122 


1.224 


1 


1 


100 




(0.0028) 


(0.2816) 


(0) 


(0) 




6 = 0.075 


0.0064 


0.649 


1 


1 


100 




(0.0017) 


(0.1715) 


(0) 


(0) 




T-6 = 0.075 


0.0064 


0.649 


1 


1 


100 




(0.0017) 


(0.1715) 


(0) 


(0) 




6 = 0.1 


0.0023 


0.2330 


1 


1 


100 




(0.0008) 


(0.0843) 


(0) 


(0) 




T-6 = 0.1 


0.0023 


0.2330 


1 


1 


100 




(0.0008) 


(0.0843) 


(0) 


(0) 





Tab. 1. Results for the model with censored matrix, s = 1. 
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Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.1262 


23.57 


96.47 


2 







(0.0218) 


(3.813) 


(1.670) 


(0) 




T-Lasso 


0.0456 


4.688 


2.290 


2 


77 




(0.0194) 


(2.157) 


(0.5881) 


(0) 




Dantzig 


0.0792 


8.000 


68.79 


2 







(0.0149) 


(2.159) 


(4.901) 


(0) 




T-Dantzig 


0.0404 


4.0558 


2.04 


2 


97 




(0.0143) 


(1.612) 


(0.2416) 


(0) 




5 = 0.05 


0.0064 


0.6654 


2.15 


2 


89 




(0.0039) 


(0.4247) 


(0.4769) 


(0) 




T-6 = 0.05 


0.0063 


0.6535 


2 


2 


100 




(0.0039) 


(0.4314) 


(0) 


(0) 




6 = 0.075 


0.0015 


0.1535 


2 


2 


100 




(0.0016) 


(0.1637) 


(0) 


(0) 




T-S = 0.075 


0.0015 


0.1535 


2 


2 


100 




(0.0016) 


(0.1637) 


(0) 


(0) 




5 = 0.1 


0.0059 


0.5410 


2 


2 


100 




(0.0045) 


(0.3773) 


(0) 


(0) 




T-S = 0.1 


0.0059 


0.5410 


2 


2 


100 




(0.0045) 


(0.3773) 


(0) 


(0) 




Tab. 2. Results for the model with censored matrix, 


s = 2. 




Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.1834 


34.54 


96.91 


3 







(0.0326) 


(6.156) 


(1.407) 


(0) 




T-Lasso 


0.0776 


8.832 


4.28 


3 


25 




(0.0306) 


(3.907) 


(1.068) 


(0) 




Dantzig 


0.1209 


12.27 


73.83 


3 







(0.0259) 


(3.556) 


(3.945) 


(0) 




T-Dantzig 


0.0597 


6.108 


3.40 


3 


66 




(0.0251) 


(2.877) 


(0.6164) 


(0) 




6 = 0.05 


0.0055 


0.5287 


3.19 


3 


85 




(0.0059) 


(0.4952) 


(0.5038) 


(0) 




T-6 = 0.05 


0.0053 


0.5209 


3 


3 


100 




(0.0058) 


(0.5064) 


(0) 


(0) 




6 = 0.075 


0.0148 


1.296 


3.05 


3 


95 




(0.0110) 


(0.7843) 


(0.2179) 


(0) 




T-5 = 0.075 


0.0148 


1.302 


3 


3 


100 




(0.0109) 


(0.7935) 


(0) 


(0) 




5 = 0.1 


0.0415 


3.791 


3.02 


3 


98 




(0.0177) 


(1.1552) 


(0.1400) 


(0) 




T-5 = 0.1 


0.0415 


3.793 


3 


3 


100 




(0.0177) 


(1.159) 


(0) 


(0) 





Tab. 3. Results for the model with censored matrix, s = 3. 
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Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.3183 


57.68 


97.57 


5 







(0.0596) 


(10.51) 


(1.089) 


(0) 




T-Lasso 


0.1693 


20.22 


10.31 


5 







(0.0551) 


(7.408) 


(2.331) 


(0) 




Dantzig 


0.2225 


22.68 


81.04 


5 







(0.0429) 


(6.275) 


(3.967) 


(0) 




T-Dantzig 


0.1159 


12.08 


7.87 


5 


3 




(0.0430) 


(5.174) 


(1.6891) 


(0) 




5 = 0.05 


0.0596 


4.544 


5.52 


5 


63 




(0.0417) 


(2.457) 


(0.8423) 


(0) 




T-6 = 0.05 


0.0592 


4.613 


5.08 


5 


92 




(0.0414) 


(2.535) 


(0.2712) 


(0) 




6 = 0.075 


0.1327 


11.11 


5.12 


5 


91 




(0.0566) 


(3.059) 


(0.4069) 


(0) 




T-6 = 0.075 


0.1327 


11.14 


5.03 


5 


97 




(0.0565) 


(3.097) 


(0.1705) 


(0) 




5 = 0.1 


0.2331 


20.29 


5.06 


5 


95 




(0.0698) 


(3.154) 


(0.2764) 


(0) 




T-6 = 0.1 


0.2371 


20.61 


4.97 


4.95 


98 




(0.0792) 


(3.933) 


(0.2628) 


(0.21) 





Tab. 4. Results for the model with censored matrix, s = 5. 





Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.7181 


100.7 


97.98 


10 







(0.1426) 


(19.38) 


(0.8364) 


(0) 




T-Lasso 


0.5560 


55.09 


27.02 


10 







(0.1499) 


(14.57) 


(3.781) 


(0) 




Dantzig 


0.5625 


55.11 


87.71 


10 







(0.1383) 


(13.33) 


(3.672) 


(0) 




T-Dantzig 


0.4203 


40.41 


22.33 


9.98 







(0.1467) 


(12.91) 


(3.212) 


(0.1400) 




(5 = 0.01 


0.3142 


24.91 


31.6 


10 







(0.1614) 


(7.068) 


(4.079) 


(0) 




T-6 = 0.01 


0.2760 


20.41 


14.13 


9.95 







(0.1612) 


(7.539) 


(1.677) 


(0.2598) 




6 = 0.05 


0.9679 


56.18 


14.11 


9.33 


2 




(0.3688) 


(14.65) 


(2.403) 


(0.8724) 




T-6 = 0.05 


1.0187 


62.38 


10.07 


8.23 


16 




(0.4088) 


(17.90) 


(1.226) 


(1.5,35) 




6 = 0.1 


1.392 


98.89 


10.31 


7.94 


14 




(0.2821) 


(11.27) 


(1.514) 


(1.391) 




T-6 = 0.1 


1.483 


108.1 


6.92 


5.95 


37 




(0.3003) 


(12.99) 


(1.324) 


(1.519) 





Tab. 5. Results for the model with censored matrix, s = 10. 

Our first observation is that using the Lasso estimator or the Dantzig se- 
lector (i.e., ignoring the matrix uncertainty) has severe consequences. These 

methods exhibit erratic behavior aheady for the minimal sparsity s = 1. 
Though their sets of non-zero components steadily include the relevant set, 
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they are much too large and the results are very far from the correct se- 
lection. We also see that the M[/-selector strictly improves upon the Lasso 
estimator and the Dantzig selector for all the considered error criteria and 
values of s. In particular, for 5 = 0.1 and s = 1,2,3,5, it almost system- 
atically retrieves the sparsity pattern and the two error measures remain 
very small. This is obviously no longer the case for the bigger value s = 10. 
However, note that the M[7-selector remains quite satisfactory in terms of 
selecting the sparsity pattern since the average number of retrieved coeffi- 
cients is about 10 and the average number of retrieved coefficients is about 
8. Thresholding the coefficients logically improves the retrieved sparsity pat- 
terns of the Lasso estimator and Dantzig selector. Nevertheless, in most of 
the cases the MC/-selector outperforms their thresholded versions as well. 
This fact is even more significant because we simulate with a threshold which 
has been well chosen knowing the true value of the non-zero coefficients. In 
practice, choosing a relevant threshold is a very intricate question since the 
order of magnitude of the non-zero coefficients is typically unknown. On the 
other hand, for the MC/-selector thresholding can be avoided. Indeed, its 
effect is not significant, especially when s is small. This is due to the fact 
that the original (non-thresholded) M[/-selector is already very accurate in 
recovering the sparsity pattern. 

Finally, note that the good results for the MU-selectoi are not due to 
the fact that we optimize over Q = M.^^ instead of = M.^^^. In particular, 
taking 6 = leads to the same kind of results as those for the Dantzig 
selector. 



7.2. Model with missing data. We consider now the model with missing 
data as defined in Section 2. We design the numerical experiment in the 
same way as in Section 7.1 except that the observed matrix Z is now given 
by ((tJ) with tt = 0.1. 
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Results. 





Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.0212 


2.606 


94.59 


1 







(0.0105) 


(1.232) 


(3.256) 


(0) 




T-Lasso 


0.0011 


0.111 


1 


1 


100 




(0.0010) 


(0.1019) 


(0) 


(0) 




Dantzig 


0.0109 


1.114 


64.24 


1 







(0.0072) 


(0.7360) 


(11.06) 


(0) 




T-Dantzig 


0.0011 


0.1097 


1 


1 


100 




(0.0010) 


(0.1030) 


(0) 


(0) 




6 = 0.05 


0.0041 


0.3376 


8.55 


1 


6 




(0.0029) 


(0.2218) 


(5.087) 


(0) 




T-6 = 0.05 


0.0022 


0.2271 


1 


1 


100 




(0.0012) 


(0.1203) 


(0) 


(0) 




6 = 0.075 


0.0039 


0.3449 


3.99 


1 


29 




(0.0021) 


(0.1625) 


(3.090) 


(0) 




T-S = 0.075 


0.0031 


0.3133 


1 


1 


100 




(0.0011) 


(0.1124) 


(0) 


(0) 




5 = 0.1 


0.0047 


0.4490 


1.94 


1 


61 




(0.0019) 


(0.1356) 


(1.605) 


(0) 




T-5 = 0.1 


0.0044 


0.4451 


1 


1 


100 




(0.0012) 


(0.1268) 


(0) 


(0) 




Tab. 6. Results for the model with missing data, s 


= 1, 




Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.0425 


4.675 


96.02 


2 







(0.0162) 


(1.786) 


(2.074) 


(0) 




T-Lasso 


0.0047 


0.4481 


2.02 


2 


98 




(0.0037) 


(0.3318) 


(0.1400) 


(0) 




Dantzig 


0.0269 


2.695 


74.39 


2 







(0.0134) 


(1.4823) 


(5.774) 


(0) 




T-Dantzig 


0.0046 


0.4330 


2 


2 


100 




(0.0035) 


(0.3194) 


(0) 


(0) 




5 = 0.05 


0.0131 


1.033 


6.76 


2 


13 




(0.0078) 


(0.4688) 


(3.572) 


(0) 




T-6 = 0.05 


0.0106 


1.018 


2 


2 


100 




(0.0055) 


(0.4692) 


(0) 


(0) 




6 = 0.075 


0.0167 


1.517 


3.20 


2 


48 




(0.0071) 


(0.4557) 


(1.489) 


(0) 




T-5 = 0.075 


0.0160 


1.525 


2.01 


2 


99 




(0.0064) 


(0.4584) 


(0.099) 


(0) 




5 = 0.1 


0.0247 


2.351 


2.27 


2 


77 




(0.0074) 


(0.4634) 


(0.5264) 


(0) 




T-5 = 0.1 


0.0245 


2.362 


2 


2 


100 




(0.0070) 


(0.4731) 


(0) 


(0) 





Tab. 7. Results for the model with missing data, s = 2. 
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Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.0721 


6.828 


96.89 


3 







(0.0251) 


(2.116) 


(1.449) 


(0) 




T-Lasso 


0.0134 


1.225 


3.12 


3 


88 




(0.0093) 


(0.8126) 


(0.3250) 


(0) 




Dantzig 


0.0496 


4.844 


80.45 


3 







(0.0204) 


(2.075) 


(4.693) 


(0) 




T-Dantzig 


0.0117 


1.119 


3.05 


3 


95 




(0.0082) 


(0.8438) 


(0.2180) 


(0) 




5 = 0.05 


0322 


2.591 


6.8 


3 


10 




(0.0138) 


(0.7730) 


(2.942) 


(0) 




T-6 = 0.05 


0.0293 


2.726 


3.04 


3 


96 




(0.0119) 


(0.8735) 


(0.1959) 


(OJ 




6 = 0.075 


0.0439 


4.0308 


3.96 


3 


50 




(0.0137) 


(0.7988) 


(1.333) 


(0) 




T-S = 0.075 


0.0432 


4.1098 


3.01 


3 


99 




(0.0130) 


(0.8505) 


(0.0994) 


(0) 




6 = 0.1 


0.0653 


6.217 


3.21 


3 


84 




(0.0160) 


(0.8355) 


(0.5156) 


(0) 




T-6 = 0.1 


0.0651 


6.235 


3 


3 


100 




(0.0158) 


(0.8500) 


(0) 


(0) 




Tab. 8. Results for the model with missing 


data, s 


= 3, 




Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.1302 


9.993 


97.24 


5 







(0.0499) 


(2.657) 


(1.097) 


(0) 




T-Lasso 


0.0418 


3.331 


5.65 


5 


56 




(0.0326) 


(2.056) 


(0.899) 


(0) 




Dantzig 


0.1005 


9.371 


84.36 


5 







(0.0443) 


(4.113) 


(4.009) 


(0) 




T-Dantzig 


0.0365 


3.356 


5.38 


5 


74 




(0.0275) 


(2.454) 


(0.7454) 


(0) 




6 = 0.05 


0.1033 


8.301 


8.19 


5 


14 




(0.0384) 


(1.713) 


(2.591) 


(0) 




T-6 = 0.05 


0.1001 


8.900 


5.14 


5 


87 




(0.0362) 


(2.146) 


(0.3746) 


(0) 




6 = 0.075 


0.1485 


13.25 


5.96 


5 


48 




(0.0415) 


(1.716) 


(1.272) 


(0) 




T-6 = 0.075 


0.1477 


13.53 


5.05 


5 


95 




(0.0402) 


(1.999) 


(0.2179) 


(0) 




5 = 0.1 


0.2133 


19.60 


5.31 


5 


79 




(0.0494) 


(1.708) 


(0.7835) 


(0) 




T-6 = 0.1 


0.2131 


19.70 


5.03 


5 


97 




(0.0488) 


(1.904) 


(0.1705) 


(0) 





Tab. 9. Results for the model with missing data, s = 5. 
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Erri 


Err2 


Nbi 


Nb2 


Exact 


Lasso 


0.4746 


18.04 


98.01 


10 







(0.1702) 


(4.334) 


(0.7549) 


(0) 




T-Lasso 


0.4358 


13.74 


37.44 


10 







(0.1710) 


(3.357) 


5.317) 


(0) 




Dantzig 


0.4229 


38.13 


90.77 


10 







(0.1684) 


(15.95) 


(2.853) 


(0) 




T-Dantzig 


0.3862 


34.84 


32.77 


10 







(0.1690) 


(16.35) 


(5.184) 


(0) 




6 = 0.01 


0.2891 


10.78 


47.38 


10 







(0.1285) 


(2.059) 


(5.351) 


(0) 




T-6 = 0.01 


0.2725 


12.26 


19.93 


10 







(0.1271) 


(2.719) 


(3.311) 


(0) 




6 = 0.05 


0.7710 


45.89 


18.02 


9.91 







(0.2755) 


(6.212) 


(3.720) 


(0.2861) 




T-6 = 0.05 


0.7719 


48.36 


13.41 


9.73 


6 




(0.2807) 


(7.134) 


(2.015) 


(0.6611) 




5 = 0.1 


1.182 


84.80 


13.42 


9.37 


6 




(0.2983) 


(8.477) 


(2.324) 


(0.8204) 




T-5 = 0.1 


1.196 


87.42 


10.81 


8.78 


23 




(0.3104) 


(9.304) 


(1.521) 


(1.338) 





Tab. 10. Results for the model with missing data, s = 10. 

We see that again the Lasso and Dantzig selector are highly unstable in 
selecting the sparsity pattern, whereas the MC/-selector does a good job. 
The thresholded estimators T-Lasso and T-Dantzig are also quite accurate 
in retrieving the sparsity pattern, except for s = 10. However, in all the 
cases the MC/-selector does it better. The MC/-selector with 6 = 0.05 (or 
S = 0.01 for s = 10) has the smallest error measures Erri and Err2, whereas 
the sparsity pattern is better retrieved for 6 = 0.1. This reflects a trade-off 
between estimation and selection. Smaller values of 6 lead to smaller errors 
Erri and Err2 , whereas larger values of 6 lead to a very accurate recovery of 
the sparsity pattern. The error measures Erri and Err2 of the thresholded 
estimators T-Lasso and T-Dantzig are somewhat smaller than those of the 
MC/-selector, except for s = 10. Note, however, that we report the results 
for the performance of T-Lasso and T-Dantzig with a threshold based on 
the knowledge of the true coefficients. 

7.3. Portfolio replication. We now present a "toy" application based on 
financial data. We apply model ([T|-(|2]) and the M[/-selector in the context 
of portfolio replication as described in Section 2. We take the data of the 
open and close prices of p = 491 assets in the Standard and Poors S&iF 
500 index for the n = 251 trading days of 2007. These data are provided by 
the Yahoo Finance Database. The assets we use are those available for the 
whole year. 
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Experiment. Let p°j and pf^- denote the open and close prices of the j-th. 
asset for the i-th day. Our experiment is the fohowing. 

— We consider the matrix X with entries {X)ij = p^- — p°- and define X as 

the normahzed matrix obtained from X. 

— We pick s assets to build our portfolio. The coordinate of each chosen 
asset in the vector 6 G M^^^ is set to l/s and the other coordinates to 
(note that in practice, if the j-th asset is in the portfolio, it means that the 
corresponding coordinate of is l/{s(Tj) where dj is the empirical standard 
deviation of its absolute returns). 

— We consider the following six portfolios: 



s = 2 


s = 3 


Boeing, Goldman Sachs 
Boeing, Coca Cola 
Boeing, Ford 


Boeing, Google, Goldman Sachs 
Boeing, Google, Coca Cola 
Boeing, Google, Ford 



Tab. 11. Initial portfolios. 



— We compute y = XO + ^ where ^ is the same noise as in Section 7.1 In 
practice the noise ^ can reflect an uncertainty about the management costs, 
a lack of transparency in the definition of the returns of the portfolio or 
some rounding approximations. 

— We consider a matrix uncertainty of the following type: Z is obtained 
from X by replacing one of its columns by the zero column. The column 
corresponds to one of the assets in the portfolio. The goal of this manip- 
ulation is to mimic the fact that in practice not all the existing assets are 
in our restricted class. One of the assets in the portfolio does not belong to 
the restricted class since the corresponding column of X is suppressed. Of 
course, this asset cannot be retrieved. We suppress the column associated 
to an asset different from Boeing and Google. 

— We solve (30) with such a matrix Z, with 5 = 0.5 and e chosen as in 
Section 7.1 We also compute the Lasso estimator and the Dantzig selector. 

Results. We write B for Boeing and G for Google. The initial portfolios 
and the portfolios retrieved by the MC/-selector are presented in Table 8. 



Initial Portfolio 


Retrieved Portfolio 


B, Goldman Sachs 
B, Coca Cola 
B, Ford 


B, Morgan Stanley, Merrill Lynch 
B, Pepsico 
B, General Motors 


B, G, Goldman Sachs 
B, G, Coca Cola 
B, G, Ford 


B, G, Morgan Stanley, Merrill Lynch 
B, G 

B, G, General Motors 
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Tab. 12. Retrieved portfolios, MU-selector. 

The results are very satisfying. Indeed, the algorithm almost always finds 
the correct number of assets in the portfolio and the discarded asset is 
replaced by one or two assets that are intuitively close to it. Moreover, 
if one takes 5 = 0.4, then for the initial portfolio [Boeing, Google, Coca 
Cola] the retrieved portfolio becomes [Boeing, Google, Pepsico], whereas 
the other results remain the same. Finally, note that the Lasso estimator 
and the Dantzig selector (usual Dantzig selector or M?7-selector with 6 = 0) 
systematically output more than 20 assets in the retrieved portfolio. 
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